You Tube Data Models
This document describes the YouTube-related data models and processing pipeline used to extract, transform, and consume YouTube video metadata and transcripts. It covers:
Data structures for video metadata and transcript data
Subtitle retrieval and transcript cleaning schemas
Validation and normalization patterns for YouTube URLs and video IDs
Transformation patterns for timestamps and content deduplication
Error handling and fallback strategies during content extraction
Integration points with the YouTube service and prompt chain
The YouTube processing stack spans models, tools, prompts, services, and routers:
Data models define typed structures for video info and request/response payloads
Tools implement URL parsing, metadata extraction, subtitle retrieval, and transcript cleaning
Prompts orchestrate context assembly and LLM prompting
Services coordinate processing and integrate with external APIs
Routers expose endpoints for YouTube queries
video metadata"] ReqVI["VideoInfoRequest
url"] ReqSub["SubtitlesRequest
url, lang"] ResSub["SubtitlesResponse
subtitles"] end subgraph "Tools" ExtractID["extract_id.py
extract_video_id(url)"] GetInfo["get_info.py
get_video_info(url)"] GetSubs["get_subs.py
get_subtitle_content(url, lang)"] TGInit["transcript_generator/__init__.py
processed_transcript(text)"] TGClean["clean.py
clean_transcript(text)"] TGDup["duplicate.py
remove_sentence_repeats(text)"] TGSRT["srt.py
clean_srt_text(raw)"] TGTs["timestamp.py
clean_timestamps_and_dedupe(text)"] end subgraph "Prompts" YTChain["youtube.py
youtube_chain, fetch_transcript()"] end subgraph "Services" YTSvc["youtube_service.py
YouTubeService.generate_answer(...)"] end subgraph "Routers" YTAPI["routers/youtube.py
POST /"] end ReqVI --> GetInfo ReqSub --> GetSubs GetInfo --> YT GetSubs --> TGInit TGInit --> TGClean TGInit --> TGDup TGInit --> TGSRT TGInit --> TGTs YTChain --> GetSubs YTChain --> TGInit YTSvc --> YTChain YTAPI --> YTSvc
Diagram sources
Section sources
YTVideoInfo: Typed container for YouTube video metadata and optional transcript
VideoInfoRequest: Minimal request for video info endpoint
SubtitlesRequest: Request specifying URL and preferred subtitle language
SubtitlesResponse: Response containing extracted subtitles
extract_video_id: Utility to extract YouTube video IDs from supported URL forms
get_video_info: Orchestrates metadata extraction and optional transcript assembly
get_subtitle_content: Retrieves subtitles with fallback to audio transcription
processed_transcript: Pipeline of transcript cleaning and normalization
youtube_chain: Prompt chain assembling context and invoking the LLM
Section sources
End-to-end YouTube processing flow:
Router validates inputs and delegates to YouTubeService
Service either uses a direct file-based path (with Google GenAI) or invokes the prompt chain
Prompt chain fetches transcript via get_subtitle_content and processed_transcript
get_video_info enriches metadata and optionally attaches cleaned transcript
Tools handle URL parsing, subtitle retrieval, and robust fallbacks
Diagram sources
Data Model: YTVideoInfo#
YTVideoInfo defines the canonical video metadata structure used across the system. It includes:
title: Video title with default fallback
description: Video description
duration: Video duration in seconds
uploader: Channel or uploader name
upload_date: ISO-like date string
view_count: Number of views
like_count: Number of likes
tags: List of tags
categories: List of categories
captions: Optional caption content
transcript: Optional cleaned transcript
Diagram sources
Section sources
Data Model: Requests and Responses#
VideoInfoRequest: Minimal payload requiring a URL for video info retrieval
SubtitlesRequest: Payload requiring a URL and optional language preference
SubtitlesResponse: Response carrying the extracted subtitles
Diagram sources
Section sources
URL Parsing and Validation: extract_video_id#
The extract_video_id utility parses YouTube URLs and supports:
youtu.be/VIDEO_ID It returns the extracted video ID or None on failure.
Diagram sources
Section sources
Metadata Extraction: get_video_info#
get_video_info uses yt-dlp to extract metadata and optionally attach a cleaned transcript:
Builds yt-dlp options to avoid downloads and warnings
Extracts metadata fields into YTVideoInfo-compatible dictionary
Attempts subtitle retrieval and applies error detection
Applies processed_transcript to normalize and clean the transcript
Returns YTVideoInfo with optional transcript
Diagram sources
Section sources
Subtitle Retrieval and Fallback: get_subtitle_content#
Subtitle retrieval follows a prioritized strategy:
Single-pass attempt for preferred language (manual + auto-generated + auto-translated)
Alternative language selection from available tracks
One retry for alternative language if needed
Fallback to audio download and transcription via faster-whisper
Diagram sources
Section sources
Transcript Cleaning Pipeline: processed_transcript#
The processed_transcript pipeline normalizes and cleans raw subtitle text:
clean_srt_text: Removes SRT/VTT timestamp lines and artifacts
clean_timestamps_and_dedupe: Strips arrow-based timestamps and inline cues, deduplicates lines
clean_transcript: Paragraphizes and removes cue/spkr tags
remove_sentence_repeats: Collapses repeated sentences
Diagram sources
Section sources
Prompt Chain and Answer Generation: youtube_chain#
The prompt chain composes context from fetched transcripts and invokes the LLM:
fetch_transcript: Retrieves and cleans transcript, handling known error conditions
get_context: Supplies transcript to the chain
youtube_chain: Assembles prompt with context, question, and chat history
youtube_service: Invokes chain or uses file-based path depending on attachment
Diagram sources
Section sources
Router depends on YouTubeService for processing
YouTubeService depends on the prompt chain and optional GenAI SDK
Prompt chain depends on get_subtitle_content and processed_transcript
get_video_info depends on get_subtitle_content and processed_transcript
get_subtitle_content depends on yt-dlp and faster-whisper for fallback
Diagram sources
Section sources
Minimize redundant subtitle downloads by preferring single-pass retrieval and caching cleaned transcripts where appropriate
Use language prioritization to reduce fallback attempts
Favor CPU-based whisper models for resource-constrained environments
Avoid unnecessary file writes by cleaning temporary directories promptly after processing
Common issues and resolutions:
Video unavailable: Detected by known error messages and DownloadError variants; fallback to whisper transcription is triggered
Rate limiting (429): Detected and triggers whisper fallback
No subtitles available: Falls back to audio download and transcription
Known error prefixes: Recognized and treated as non-transcript errors
Cleanup failures: Temporary directories are removed with logging on cleanup errors
Operational checks:
Verify GOOGLE_API_KEY or GEMINI_API_KEY for file-based processing
Confirm yt-dlp and faster-whisper availability and permissions for temp directories
Validate URL formats supported by extract_video_id
Section sources
The YouTube data model and processing pipeline provide a robust, layered approach to extracting and transforming YouTube content. The design emphasizes:
Strongly typed data models for predictable consumption
Comprehensive subtitle retrieval with intelligent fallbacks
A modular transcript cleaning pipeline for normalized content
Clear separation of concerns between routing, service orchestration, and tooling
Practical error handling and cleanup to maintain reliability
Data Model Reference#
YTVideoInfo
title: string, default “Unknown”
description: string, default empty
duration: integer, default 0
uploader: string, default “Unknown”
upload_date: string, default empty
view_count: integer, default 0
like_count: integer, default 0
tags: array of strings, default empty
categories: array of strings, default empty
captions: string or null, default null
transcript: string or null, default null
VideoInfoRequest
url: string
SubtitlesRequest
url: string
lang: string, default “en”
SubtitlesResponse
subtitles: string
Validation and normalization rules:
URL parsing supports youtube.com/watch?v=VIDEO_ID and youtu.be/VIDEO_ID
Transcript cleaning removes timestamps, cue tags, speaker tags, and duplicate lines
Error detection treats specific messages and prefixes as non-transcript errors
Fallback to whisper transcription ensures minimal failure impact
Section sources